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ABSTRACT 


— ~~>Tbe  past  decade  has  seen  the  publication  of  a  larfe  number  of  books  and  papers  on  die 
analysis  of  multi-way  contingency  tables  using  loglinear  and  logit  models.  The  present  article 
presents  a  summary  of  die  statistical  theory  that  underlies  much  of  this  work,  and  provides 
some  linkage  to  models  and  methods  of  special  interest  to  psychometricians  and 
econometricians.  The  discussion  indudes  a  review  of  recent  and  current  research  on  die 
computation  of  maximum  likelihood  estimates  for  loglinear  and  logit  models,  especially  for 
large  multi-way  contingency  tables.  ^ 


Key  Words:  Bradley-Terry  paired  comparisons  model;  Contingency  tables;  Logit  models; 
Loglinear  models;  Rasch  model. 
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1.  INTRODUCTION 

When  elementary  statistics  texts  write  about  contingency  table  analysis  they  usually  discus 
only  two-way  tables,  and  the  Pearson  chi-square  test  for  independence.  Yet,  almost  50  years 
ago,  Bartlett  (1935)  began  statisticians  on  a  path  of  research  on  the  analysis  of  three-way  and 
higher-way  contingency  tables  that  has  branched  into  many  different  directions.  This  paper 
attempts  to  present  a  concise  summary  of  one  of  these  branches,  that  dealing  with  the  use  of 
loglinear  models  and  their  analysis  using  the  method  of  maximum  likelihood.  Some  basic 
references  for  this  literature  include  Andersen  (1980),  Bishop,  Fienberg,  and  Holland  (1975), 
Birch  (1963),  Bock  (1975,  Chapter  8).  Fienberg  (1980),  Goodman  (1978),  and  Haberman  (1974, 
1978,  1979).  Chapter  1  of  Fienberg  (1980)  provides  further  details  on  the  historical 
development  of  this  literature. 

In  Section  2,  we  summarize  the  basic  theory  associated  with  the  loglinear  model  and  its 
constrained  counterpart,  the  logit  model  We*  consider  various  sampling  structures  for 
contingency  tables,  and  make  use  of  more  general  results  on  maximum  likelihood  estimation  for 
exponential  family  distributions.  This  material  on  loglinear  models  is  linked  to  die  recent 
econometric  literature  on  retrospective  choke-based  sampling,  and  simultaneous  logit  and 
loglinear  models. 

In  Section  3,  we  illustrate  the  use ‘of  the  general  loglinear  model  results  for  two  statistical 
problems,  whose  literatures  have  evolved  separately.  More  specifically,  we  show  how  these 
non-contingency-table  problems  can  be  given  '  contingency-table-like  representations.  Then  in 
Section  4  we  briefly  summarize  the  recent  literature  on  correspondence  analysis,  and  its  links 
to  the  loglinear  model  literature. 

Finally,  in  Section  5  we  focus  on  computational  aspects  of  maximum  likelihood  estimation 
for  loglinear  models,  and  we  outline  some  current  research  efforts  on  computation  for  very 
large  contingency  tables. 
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2.  LOGLINEAR  MODELS  AND  METHODS 

For  many  examples  of  discrete-response  data  it  is  natural  to  assume  a  Poisson  or  multinomial 
distribution  for  the  observed  counts.  When  such  an  assumption  seems  reasonable  the  methods 
aasotiated  with  loglinear  models  may  be  appropriate  for  the  data.  Subject  to  the  distributional 
requirements,  the  full  power  of  die  methodology  is  available  for  any  problem  where  the 
response  variables  are  discrete  and  the  explanatory  variables  are  either  discrete,  continuous,  or 
some  combination  thereof.  Loglinear  models  are  most  commonly  applied  to  contingency  tables. 
These  arise  when  all  of  the  explanatory  variables  are  discrete,  and  the  data  can  be  represented 
as  a  p-way  cross-classification,  where  p  is  the  total  number  of  response  and  explanatory 
variables.  When  some  of  the  explanatory  variables  are  continuous  it  is  not  possible  to  display 
the  complete  data  as  a  cross-classification  but  loglinear  model  methods  are  still  appropriate. 

The  loglinear  model  is*  a  categorical  data  tool  which  closely  resembles  the  regression  and 

analysis  of  variance  (ANOVA)  models  for  continuous  data.  Regression  models  define  a 

decomposition  of  the  expected  values  of  the  data  (actually  conditional  expectations  for  responses 

given  the  values  of  certain  explanatory  variables).  Most  applications  take  a  linear 

decomposition  of  the  space  of  all  possible  expected  values.  Just  as  linear  regression  is  the 

natural  and  most  easily  manipulated  model  (contrasted  with,  say'  nonlinear  regression)  for 

% 

continuous  data,  the  natural  model* for  categorical  data  is  linear  in  the  logarithms  of  the 
expected  values.  *  This  formulation  is  mathematically  simple,  and  more  significantly,  the 
important  concepts  of  independence  and  conditional  independence  can  be  easily  expressed  in  the 
loglinear  formulation  (for  details,  see  Bishop,  Fienberg,  and  Holland,  1973). 

In  order  to  describe  the  loglinear  model  and  its  justification  we  need  to  develop  some 
notation.  Let  I  be  a  finite  index  set  and  consider  a  vector  of  observed  counts, 

x  -  (x,  ;  i  «  I  ,  xf  «  2  )  (2.1) 

which  are  considered  to  be  realizations  of  a  set  of  random  variables, 

X  *  (X(  ;  i  «  I  }.  (2.2) 

The  sat  I  indexes  the  pomible  outcomes.  The  random  variable  X  has  expectation 
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could  be  described  in  terms  of  the  index  set 
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or  by 

1*  •  IUA4A6) 

if  some  ordering  convention  is  adopted.  Suppose  that  die  cells  of  the  2X3  table  contained 

continuous  or  meenranant  data.  Thao  the  standard  ANOVA  model  for  the  (Lj)  cell  entry 

would  be:  *■ 

<  • 

'  f(X  )  ■  n  ♦  n  ♦  u  ♦  u  03) 

.IT  iu)  ayi  iaos  .  . 

wim  coostnura 

■  ?%  *  F»«.»  ■  Pw  ■ 0  • 

Le.  a  row^us-^unm-phw- interaction  affects  asodcL  When  the  entries  (Xy)  are  counts,  die 
corresponding  model  for  the  mpeetatiens  of  the  counts  is 


♦  %  *  »,*,  • 
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with  the  same  constraints  as  before,  Le.  those  hi  expremiou  06).  Setting  uf„in  *  0  for  all  i 
and  j  in  07)  hnpika  that  the  variablm-  comwponikg  to  rows  and  columns  are  independent 
In  fnmal  indapandapos  and,  in  higher  at**- conditional  independence  models  have  a 

Just  as  nonlinear  models  in  ANOVA  and  regrmaion 
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atamtioui  are  a  pomibUity,  notHogfinatr  models  for  counts  are  poaiUe  but  the  mathematics  of 
eaiiaarton  fa  more  difficult  than  for  die  loghnear  models  described  in  this  paper. 

The  statistical  literature  on  frequency  data  has  focussed  on  two  basic  sampling  schemes  and 
associated  probability  distributions  for  the  random  variables  X.  The  sampling  models,  die 
Poisson  and  product-multinomial  paradigms.  are  intimately  related  (as  product  multinomial 
sampling  can  be  obtained  as  a  conditional  version  of  Poisson  sampling),  and  both  result  hi 
theories  when  linear  restrictions  on  /<  produce  natural  models  of  interest  Extensive  accounts 
of  die  theory  of  ksglinear  models  are  available  in  Bishop.  Fienberg  and  Holland  (1975).  Plackett 
(1981).  and  especially  Haberman  (1974).  Extensions  to  more  complex  sampling  structures  can  be 
found  in  Brier  (1980).  Fellegi  (1980),  and  Rao  and  Scott  (1981). 


2.1.  THE  POISSON  MODEL 

Under  the  Poisson  model  the  dements  of  X  an  thought  of  as  independent  Poisson  random 
variables  X  with  expectation  m  *  £(x).  The  probability  density  for  this  situation  is 


„  m  *i  e"*i 
n  cJ-i - 


V 


as) 


which  results  in  the  following  log-likelihood  function  for  /•  *  ln(m)  given  x; 


*  <*,/»>  -  <m,l>  .  (19) 

The  most  natural  realization  of  this  modd  is  whexf  the  counts  represent  die  results  of 
simultaneous  independent  Poisson  procesMs.  with  means  m^  observed  over  a  fixed  period  of 
time. 


IX  PRODUCT-MULTINOMIAL  MODEL 
Let  us  pertion  the  index  set  into  r  disjoint  parts  soch  that 

i  « .  aio) 

For  each  k,  (x{  ;  i  c  Ik)  has  a  multinomial  distribution  with  mean  (m(  ;  i  <  7k  1.  In  more 
standard  parlance,  if  nk  is  the  sample  size  then  Cx(  ;  i  *  has  a  multinomial  distribution 
with  parameters  nk  and  probability  vector 
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Pk  *'  la,  /  nk  ;  i  « Ik  ).  0.11) 

The  important  distinction  is  that  here  the  sample  size  is  known  a-priori  The  full  product- 
multinomial  model  assumes  that  the  r  multinomial  distributions  so  determined  are  independent 
with  possibly  some  relationship  between  the  means  of  the  different  distributions.  Let  us 
denote  by  vk  the  indicator  vector  of  Ik,  Le. 

V  (  1  *  *  h 

\  0  otherwise.  (112) 

With  this  notation  afc  *  <»k,  x>.  The  probability  density  under  product  multinomial  sampling 
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where 


(:■) 


is  the  usual  multinomial  coefficient  The  log-likelihood  for  this  model  is 


ZM(>ipc,(nk})  =  (ln(nk!)  +  Zj  e  j  -  x.  ln(nk)  -  ln(x.!))l 
«  <pX>  , 


(114) 


subject  to  the  constraints  that  <m,r  >  «  <x,»k>  for  k  *  1  ,...,r  .  When  r  *  1  we  have 
observations  from  a  single  multinomial:  otherwise  we  have  a  set  of  r  independent  multinomials. 
The  crucial  point  to  note  is  that  the  kernel  of  both  the  Poisson  and  product-multinomial 
likelihoods  are  equivalent 
• 

Before  proceeding  with  the  theory  consider  two  examples  of  frequency  date,  one  which  can 
be  represented  as ’a  classical  contingency  table  and  one  which  cannot 

13.  TWO  EXAMPLES 

We  are  interested  in  understanding  the  factors  which  affect  academic  performance  of  a 
group  of  school  students.  The  performance  will  be  assigned  to  one  of  three  positions;  above 
average,  average,  and  below  average. 

(i)  We  select  40  students  from  each  of  4  schools  and  ascertain  the  performance  level  of  each 
student  The  resulting  data  can  be  presented  as  a  4  X  3  cross  classification  with  rows 
representing  schools  and  columns  representing  perfomance.  The  row  margins  are  fixed  by  the 


product  multinomial  sampling  design  to  be  40.  If  we  had  chosen  students  at  random  until  we 
ran  out  of  money  and  cross-classified  them  then,  the  row  margins  would  be  random.  In  this 
circumstance  the  row  margins  would  give  some  information  about  the  relative  sizes  of  the 
schools  but  would  not  otherwise  give  any  information  about  the  dependence  of  performance  on 
school.  Either  way  a  tabular  presentation  of  the  data  is  possible. 

(ii)  Consider  a  similar  experiment  except  that  now  we  take  160  students  and  record  their 
performance  and  family  income.  As  income  is  an  essentially  continuous  variable  we  cannot 
present  these  data  as  a  cross-classification  without  discarding  some  information,  but  we  might 
produce  a  table  listing  each  person,  their  performance  (the  response)  and  the  family  income 
(the  explanatory  variable). 

Of  course  for  this  type  of  study  there  are  many  potentially  useful  explanatory  variables  and 
any  well-designed  survey  would  record  some  combination  of  discrete  and  continuous  predictors. 
The  study  could  also  be  extended  by  recording  an  extra  response  variable,  e.g.  whether  the 
students’  performance  had  improved,  remained  constant,  or  declined  over  the  past  year.  One 
might  then  be  additionally  be  interested  in  the  relationship  between  the  two  response  variables 
after  the  effects  of  the  explanatory  variables  have  been  taken  into  account  Many  of  the 
questions  on  relationships  among  variables  can  be  explored  in  the  context  of  loglinear  models 
of  the  sort  described  here. 

Lest  we  be  led  to  believe  that  standard  loglinear  model  theory  is  all  powerful,  let  us 

a 

suppose  that  we  still  regard  performance  as  the  response  variable  but  that  we  had  first  selected 
SO  people  from  each  performance  category,  i.e.  we  stratified  on  the  value  of  the  response 
variable.  A  loglinear  model  may  still  be  appropriate  for  the  resulting  probabilities,  but,  as  the 
sampling  scheme  fixes  the  totals  for  the  categories  of  the  response  variable,  it  is  neither 
Poisson  nor  product-multinomial  Thus  the  standard  methods  of  frequency  data  analysis  are 
not  necessarily  appropriate.  For  a  discussion  of  tins  choice-based  paradigm  see  the  articles  by 
Manski  and  McFadden  and  by  Coslett  in  Manski  and  McFadden  (1981),  and  further  references 
therein.  We  will  restrict  ourselves  here  to  the  two  sampling  schemes  we  have  outlined  above. 
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Under  such  a  loglinear  model,  elemenary  properties  of  inner  products  can  be  invoiced  id 
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and  hence  that  Ptf t  is  a  sufficient  statistic.  In  .act  the  exponential  family  theory  assures  us 
that  Pypc  is  in  fact  a  complete  minimal  sufficient  statistic.  In  the  contingency  able  context,  M 
is  often  the  space  spanned  by  the  row  or  column  indices  of  a  table,  in  which  case  Ptf*  can  be 
obtained  from  the  row  or  column  margins  of  the  table. 

Even  more  can  be  gained  from  viewing  the  loglinear  model  in  the  exponential  family 
framework.  A  standard  result  states  that  if  the  maximum  likelihood  estimate  (MLE)  exists  it  is 
unique  and  satisfies  (i)  the  model  p  c  M,  and  (ii)  the  likelihood  equations 

Ptf  m  *  Ptf  c  ,  (120) 

Le.  the  MLE  is  obtained  by  equating  the  minimal  sufficient  statistics  with  their  expected 
values.  When  N  M.  expression  (117)  forces  expression  (115)  to  be  satisfied. 

The  likelihood  equations  for  the  two  sampling  schemes  are  identical  but  the  interpretation  of 
the  resulting  parameter  estimates  differs.  In  product-multinomial  sampling  some  of  the 
parameter  comparisons  are  meaningless  for  die  very  reason  that  the  differences  are  induced  by 
the  sampling  design. 

In  order  to  utilize  the  likelihood  equations  we  need  to  be  sure  that  the  MLE  exists.  A 
theorem  of  Haberman  (1974)  asserts  that  the  MLE  exists  if  and  only  if  there  is  a  d  «  M1,  the 

orthogonal  complement  of  M,  such  that  x  ♦  d  >  0.  Obviously  if  x  has  no  zero  .components 

*  .  % 

then  d  *  0  win  suffice.  When  there  are  observed  zeros  in  the  date  then  the  potential  for 
non-existence  of  the  MLE  is  there.  Even  when  the  MLE  does  not  exist  the  log-likelihood 

t 

function  is  still  convex.  The  only  problem  is  that  the  maximum ‘lies  ouside  the  allowable 
parameter  space,  Le.  some  components  of  ^  are  -oo  at  the  maximum.  This  corresponds  to  an 
estimated  mean  value,  nj,  of  zero  or  equivalently  to  a  zero  estimated  probability.  If  one  is 
willing  to  consider  this  extended  interpretation  of  the  MLE,  then  the  MLE  always  exists  and  is 
unique.  Unfortunately,  it  is  important  to  know  about  such  singularities  for  computational 
reasons.  We  return  to  this  problem  in  Section  5  on  computation. 

In  principle  estimation  for  loglinear  models  is  quite  simple  -  see  section  5  on  computing  for 


10 

further  details.  Without  some  indication  of  the  variances  of  the  fitted  values  (or,  equivalently, 
the  parameter  estimates),  however,  estimation  would  be  of  little  value.  Again  the  general 
theory  of  exponential  families  comes  to  our  aid.  MLE’s  in  exponential  families  have  an 
asymptotic  variance-covariance  matrix  equal  to  the  inverse  of  the  Fisher  information  matrix.  It 
is  easy  to  show  that  for  the  Poisson  likelihood  the  first  and  second  derivatives  of  the 

logliltelihood  are  x  -  m,  and  -diagim, m)  =  -  D  ,  respectively.  Using  this  notation, 

Haberman  (1974,  pp. 75-78)  has  shown  that  as  the  total  sample  size,  N,  tends  to  infinity 

i.KN'H/i-y*)]  -»  W(0^.D;')  ,  (2.21 ) 

where  m  is  the  true  mean  parameter  and  Pfa  is  the  orthogonal  projection  onto  M  with  respect 
to  the  inner  product  <.,D  .>  .  From  this  it  follows  that 

IB 

LW'Hm-  m)]  N(0,D nJ%  )  .  (122) 

By  using  these  asymptotic  normal  distributions  it  is  possible  to  directly  show  that  the  usual 
goodness-of-fit  measures, 

XJ  «  Kx(  -  m)2  /  m  (123) 

and 

G2  =  2  I  ln(x(/n|)  ,  (124) 

both  have  asymptoic  X2  distributions  with  degrees  of  freedom  equal  to  the  number  of  elements 
in  the  index  set  I  minus  the  dimension  of  M.  These  statistics  are  of  course  quite  well  known; 
the  first  one,  (123),  is  just  Pearson’s  chi-square  while  the  second,  (124),  is  the  log-likelihood- 
ratio  statistic.  Thus  we  see  that  under  Poisson  or  product-multinomial  sampling  the  loglinear 
model  arises  naturally  from  consideration  of  the  likelihood  and  the  properties  of  exponential 
families.. 

Once  we  have  decided  on  a  loglinear  model  there  is  nothing  that  dictates  that  we  must  use 
maximum  likelihood  estimation.  An  alternative  estimation  strategy  would  be  to  set  up  the 
problem  in  the  Kullback-Leibler  information  theory  framework.  Using  the  result  of  Csiszlr 
(1975)  and  Kullback  (1959)  it  is  possible  to  show  that  the  natural  Kullback-Leibler 
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representation  is  in  fact  the  convex  dual  of  the  maximum  likelihood  representation  and  that  the 
estimated  means  from  either  approach  are  equivalent  For  further  details  and  references  see 
Meyer  (1982).  Alternative  estimation  strategies  which  have  appeared  in  the  statistical  literature 
include  weighted  least  squares  and  minimum  X2  methods  (e.g.  see  Grizzle.  S tanner,  and  Koch, 
1969). 

2.5.  LOGLINEAR  VERSUS  LOGIT  MODELS 

An  important  topic  that  we  have  overlooked  thus  far  is  the  relationship  of  loglinear  and 
logit  models  to  each  other.  Consider  a  vector  of  binomial  responses  x  =  (x.;  i  «  I)  and 
sample  sizes  n  =  in.;  i  «  1}  with  probability  of  success  p.  The  logit  model  postulates  the 
relationship 

logit(p)  =  ln(p/(l  p))  «  M*  .  (125) 

where  M*  is  a  linear  space  spanned  by  appropriate  explanatory  variables.  It  is  possible  to  show 
that  every  logit  model  can  be  represented  by  an  equivalent  loglinear  model  for  the  I  x  2 
contingency  table  formsd  from  the  number  of  success  and  number  of  failures.  Thus  all  of  the 
results  for  loglinear  models  carry  over  directly  to  logit  models.  However  it  is  often  much 
easier  to  interpret  models  when  they  are  presented  in  terms  of  the  log  odds  or  logit  scale. 
Even  more  important,  it  is  often  quite  inefficient  to  use  numerical  algorithms  which  are 
suitable  for  the  general  loglinear  model  in  place  of  the  more  specific  methods  appropriate  for 
logit  models.  The  numerical  efficiency  of  logit  model  methods  is  largely  due  to  the  presence 
of  sampling  constraints  which  are  implicit  in  the  logit  formulation  but  which  must  be 
considered  explicitly  in  the  loglinear  model 

Similar  comments  are  in  general  true  for  the  simultaneous  or  multinomial  logit  models  which 
are  sometimes  used  when  a  multinomial  rather  than  binomial  response  is  observed.  Consider  a 
trinomial  response  model.  The  data  we  observe  consists  of  three  vectors,  xjt  x2,  and  x3  which 
represent  say  success,  partial  success,  and  failure  with  corresponding  probabilities  p2>  and 
P3>  and  n  the  number  of  trials,  together  with  whatever  explanatory  variables  are  necessary. 
The  simultaneous  logit  model  (e.g.  see  Fienberg,  1980)  postulates  that 


and 


ln(p/P2)  €", 


(126) 


ln(p2/p,)  «  Ma  a27) 

are  simultaneously  true,  and  that 

pi  +  p2  +  ps  =  (128) 

The  multinomial  logit  model  (e.g.  see  Neilove  and  Press,  1979,  Chamberlain.  1980,  or  Manski, 
1981)  is  equivalent  to  the  simultaneous  logit  model  Por  these  models  there  is  always,  again,  an 
equivalent  loglinear  model  As  before,  it  is  the  ease  of  interpretation  and  computational 
advantages  which  make  the  simultaneous  logit  model  useful  but  the  theoretical  results  necessary 
to  manipulate  these  models  are  direct  consequences  of  the  more  general  loglinear  model 
properties.  Konig,  Nerlove,  and  Oudiz  (1981)  provide  a  detailed  economic  application  of  the 
multinomial  logit  model 

Overall  the  loglinear  model  is  a  very  powerful  and  practical  tool  for  discrete  data  analysis. 
However  it  must  be  viewed  in  much  the  same  light  as  the  general  linear  model  It  is  a  very 
general  tool  and  one  must  be  Careful  only  to  use  it  when  appropriate.  Specialized  tools  and 
models  for  certain  discrete  data  problems  are  often  more  appropriate  just  as  it  is  easier  to 
consider  the  analysis  of  designed  experiments  independently  of  the  linear  model  interpretation. 
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3.  USING  LOGLINE AR  MODELS  FOR  SOME  "NON-CONTINGENCY"  TABLE 

PROBLEMS 

The  application  of  the  loglinear  model  results  from  Section  2  to  multidimensional  contingency 
tables  focusses  on  models  where  each  set  of  the  parameters  in  the  logarithmic  scale  is 
associated  with  one  or  more  dimensions  of  the  table.  For  example,  for  a  three-dimensional 
table  and  the  loglinear  model  representing  no  second -order  interaction  (see  Fienberg.  1980). 

loffrn  )  *  u  +  u  ♦  u  ♦  u  ♦  u  ♦  u  ’  +  u  (3.1) 

tiie  term  u,uu  is  asMciated  with  dimensions  2  and  3.  and  so  on.  Other  applications  of  the 
results  will  not  necessarily  have  this  feature,  as  will  be  apparent  in  the  examples  below. 

One  of  the  values  of  general  theoretical  results  is  that  they  are  often  applicable  to  specific 
settings  beyond  those  which  led  to  the  formulation  of  the  general  structure.  This  is  certainly 
true  for  results  on  the  analysis  of  categorical  data  problems.  Fortunately  many  of  the  "non¬ 
contingency  table"  applications  of  the  loglinear  model  results  have  contingency-table-like 
representations  so  that  we  can  infrprtt  the  results  of  our  analyses  using  whatever  intuition  we 
have  gleaned  from  the  analysis  of  contingency  table  data  using  loglinear  models.  This  section 
is  based  on  a  related  discussion  in  Fienberg  (1981),  and  contains  two  examides  of  the  -jse  of 
such  contingency-table  representations. 

3.1.  THE  BRADLEY-TERRY  PAIRED  COMPARISONS  MODEL 

Early  in  the  psychometric  literature,  Thurstone  (1927a,b)  proposed  a  model  for  binary  paired 
comparisons  using  linear  model  structure,  and  this  model  was  amplified  by  Mostdler  (1931)  and 
Bock  and  Jones  (1969,  Chapters  6  and  7).  Bradley  and  Terry  (1932)  suggested  a  simple 
variation,  on  the  Thurstone  approach  which  is,  in  effect,  a  loglinear  or  logit  model  Extensions 
of  their  approach  have  been  developed  over  the  intervening  three  decades  (for  an  excellent 
review  of  this  literature  see  Bradley,  1976). 

Suppose  t  items  (e.g.,  different  types  of  chocolate  pudding)  or  treatments,  labeled  T , 
TJt...,T ,  are  compared  in  pairs  by  sets  of  judges.  (Or  suppose  that  t  football  teams  compete 


fa  pain  fa  a  scries  of  matches.)  The  Bradley-Terry  model  postulates  fast  tbe  probability  of  T( 
w  4ng  preferred  to  T  is 


PKT,  Tf 


w  ♦  w 

1  i 


u  « iA~..t, 
i  *  i . 


(3.2) 


where  each  »  St  0  and  we  add  the  constraint  that  P  .  #  ■  L  The  model  assumes 

i  i«l  1 

independence  of  tbe  same  pair  by  different  judges  and  different  pain  by  the  same  judge.  In 
the  example  of  the  football  matches  we  assume  tbe  independence  of  outcomes  of  the  matches. 

TABLE  3-1 

Layout  for  Data  in  Paired-Comparisons  Study  with  t  *  4 


For 


Against 


T, 

T1 

T, 

T« 

. 

T, 

— 

X,1 

X.» 

X|4 

Ta 

xa, 

— 

xaa 

U 

T, 

xa. 

X» 

— 

XM 

T4 

X4, 

*43* 

V 

— 

• 

In  the  typical  paired  comparison  experiment,  T  is  compared  with  ay  *  0  times,  and  we 
let  x{j  be  the  observed  number  of  times  T(  is  preferred  to  Tj  in  these  n(J  comparisons.  Table 
3-1  shows  die  typical  layout  for  the  observed  dan  when  t  *  4,  with  preference  (for.  against) 
defining  rows  and  columns.  Clearly  die  binomial  constraint, 

xu  *  xm  %  * 

is  of  the  form  of  the  multinomial  constraints  described  in  Section  2,  and  using  die  basic 
loglinear  result  on  tbe  equivalence  of  MLE’s  for  product-multinomial  and  Poisson  sampling 
.schemes,  we  can  convert  expression  (3.2)  into  a  model  for  expected  values  for  a  Poisson 
sampling  setting,  Le. 

(3.3) 


log  m  ■  M  ♦  B  ♦  y 

u  l  7ti 


IS 


with  suitable  side  constraints.  But  this,  as  was  noted  in  Pknbert  and  Laratz  (1976),  is  simply 
die  model  of  quasi-symmetry  in  a  square  continfaocy  table  (see  Bishop,  Fienberg,  and  Holland. 
1975,  Chapter  S).  The  minimal  sufficient  statistics  are 

(x^l.  (x#J>.  (Xy  ♦  x^)  OS) 

(actually  either  the  row  or  column  totals  are  redundant),  and  we  can  use  a  trick,  mggnstad  in 
Bishop.  Fienberg.  and  Holland,  to  transform  the  problem  to  one  for  a  three-way  table  of 


expected  counts.  We  generate  duplicate  tables  and  set 


and.  for  the  observed  counts, 


x  * 

i* 


k  -  1 . 
k  -  2  . 


*k  »  1  . 
k  «  2  . 


Then  the  logiiacar  version  of  the  BndJey-Terry  model  given  by  (3.3)  and  (3.4)  becomes  die 
model  of  no  second-order  interaction  in  the  new  3-dimensional  table,  whose  minimal  sufficient 
statistics  are  ((x  1.  (x  ),  (x  >).  Thus  we  can  analyze  the  fit  of  the  model  and  variations 

ii»  I** 

on  it  in  a  familiar  contingency  table  setting  of  die  sort  described  in  Section  2.  The  model  of 
expressions  (3.3)  and  (3.4)  douM  be  approeched  directly,  but  the  duplication  involved  in 
expressions  (3.6)  and  (3.7)  turns  what  is  otherwise  a  "new  loglinear"  structure  into  a 
recognizable  one  for  multi-dimensional  tables,  and  simplifies  the  computations  of  MLE’s  no 
matter  what  iterative  method  is  used  for  their  calculation. 


As  an  example,  we  reconsider  a  t  ■  4  illustration  given  by  Dykstra  (1960),  in  which  *  0 

(this  docs  not  affect  the  identification  and  estimability  of  the  Bradley-Terry  parameters).  The 
observed  data  are  given  in  part  (0  of  the  table  and  the  estimated  expected  frequencies  are 
given  in  part  (ii)  to  ooe  decimal  place.  The  goodness-of-fit  of  the  model  is  summarized  by 
the  statistics. 
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X1  =  2.0  and  O*  «  2.4  . 

and  there  are  2  degrees  of  freedom.  The  actual  maximum  likelihood  estimates  of  the  Bradley- 
Terry  parameters  are 

*l  ■  .1082  ■  4793  ,  •i  «  .2294,  and  »4  *  .1431 

as  given  by  Dykstra. 

These  results  on  the  loglinear  representation  for  the  Bradley-Terry  model  are  by  now 
reasonably  well-known,  and  they  can  be  attended  to  more  complex  settings  involving  ties, 
multiple  comparisons,  order  effects,  and  rankings  (e.g.  see  Fienberg  and  Laratz,  1976,  Fienberg, 
1979,  and  Duncan  and  Brody,  1982).  Recent  results  by  Meyer  (1981)  are  of  special  use  in 
giving  contingency  able  representations  to  some  of  these  generalizations. 


3.2.  THE  RASCH  MODEL 

We*  now  turn  to  a  problem  which  begins  with  a  representation  as  a  two-way  able  of  0’s  and 
l’s,  and  ends  up  as  a  relatively  standard  multi-dimensional  contingency  able  problem.  The 
results  of  ability  fasts  are  often  structured  in  the  form  of  sequences  of  l’s  for  correct 
answers  and  0’s  for  incorrect  answers.  For  a  test  with  k  problems  or  items  administered  to  n 
individuals,  we  let 


i  1  if  individual  i  answers  item  j  correctly 
Y  *  ] 

,j  '  0  otherwise.  (3.8) 

Thus  we  have  a  two-way  able  of  random  variables  {Y(J)  with  realizations  {y(j}.  An 

alterative  represenation  of  the  daa  is  in  the  form  of  a  nx2*  able  (W  )  where  the 

ui*V  H 

subscript  i  still  indexes  individuals  and  now  j]  refer  to  the  correctneas  of  the  reqxnses 
on  items  l,2....,k,  respectively,  Le. 


1  if  i  responds  (j^,...^) 
^ 0  otherwise. 


(3.9) 


The  simple  Rascb  model  (see  Reach,  1960  as  reprinted  in  1980:  and  Andersen,  1980,  1983)  for 


the  (Yy)  is 


Differences  of  the  form  f  -  ate  typically  described  as  measuring  the  relative  abilities  of 
individuals  i  and  r,  while  those  of  the  form  v  -  v#  are  described  as  measuring  die  relative 
difficulties  of  items  j  and  s.  Expression  (3.10)  is  a  logit  model  in  the  usual  contingency  table 
sense  for  a  3-dimensional  array  whose  first  layer  is  (y^)  and  whose  marginal  totals  adding 
across  layers  is  an  nxk  table  of  l’s.  Because  the  Rasch  model  when  viewed  as  a  model  for 
PfYy-1).  depends  on  the  item  parameters  in  a  non-linear  way,  it  is  not  at  all  char  whether  we 
can  collapse  the  array  (w  ,  , )  by  adding  over  subjects  for  estimation  purposes.  We  return 
to  this  matter  below. 


Maximum  likelihood  estimation  for  the  parameters  of  the  Rasch  model  of  expression  (3.10) 


has  been  the  focus  of  several  authors,  including  Reach  and 


Unconditional 


Hkefibood  (UML)  estimates  can  be  derived  but  they  have  rather  proMaaaatic  asymptotic 
properties,  e.g.  die  estimates  ate  inconsistent  as  a  •+  oo  and  k  remains  moderate,  although 
they  are  consistent  when  both  n  and  k  -a  oo  (Haborman.  1177). 

Before  turning  to  an  alternative  to  the  UML  approach,  we  point  out  a  recently-derived 
result  for  UML  estimates  for  the  Reach  model  which  links  up  in  yet  another  way  with 


loglinear  structures  for  rontingsnry  tables.  In  order  to  derive 
conditions  for  the  exisieace  of  UML  estimates  (a  problem  not  really 


afcd  sufficient 
for  any  of  the 


data  structures  in  this  paper).  Fischer  (1981)  embeds  the  matrix  y  ■  (y,)  into  a  larger  (n*k)x 


(n+k)  matrix  of  the  form: 


f  0  eT-yTi 

U"’*U  .  1 


where  e  is  an  nxk  matrix  of  l’s,  so  that,  for  all  (lj). 
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Then  be  notes  that  the  Rasch  model  of  expression  (3.10)  is  transformed  into  an  incomplete 
venkm  of  the  Bradley-Terry  model  of  expression  (3.2)  discussed  at  the  beginning  of  this 


section,  ie. 


Kag-1) 


i  *  k+l^..,k+n, 
j  «  1.2....,k, 


and  similarly  for  the  other  non-zoo  block  of  entries  in  A.  where 


(3.14) 


u  *  1A.~a 
i  *  r. 


(3.15) 


log  — *-  =  v  -  v 
r  J  * 


j4  «  1.2,...k, 

j  *  *. 


(3.16) 


Thus,  using  a  three-dimensional  representation  for  A  alluded  to  at  the  beginning  of  this 
section,  we  can  show  that  estimation  results  for  the  UML  approach  to  the  Rasch  model 
correspond  to  those  for  the  no-second-order  interaction  model  (see  expression  (3.1))  applied  to 
an  incomplete  three-dimensional  contingency  connsting  of  two  zero  blocks  of  dimension  kXkX2 

and  nXnx2.  and  a  duplicated  version  of  the  nXkX2  table  with  layers  y,  and  e  -  y. 

•  .  •  * . 

Now.  we  turn  to  a  conditional  approach  to  likelihood  estimation  (CML)  advocated  initially  by 
Rasch.  who  noted  that  the  conditional  distribution  of  Y  given  the  individual  marginal  totals 

a 

(V  *  yj+}  depends  only  on  the  item  parameters,  {^1.  Then  each  of  the  row  sums  (y^l  can 

take  only  k+1  distinct  values  corresponds  to  the  number  of  correct  responses.  Next,  we  recall 

the  alternate  representation  of  the  data  in  the  form  of  an  nX2k  array.  (W  ),  as  given  by 

MiV"\ 

expression  (3.9).  Adding  across  individuals  we  create  a  2k  contingency  table.  X.  with  entries 


Vv '  ™ 

Earlier,  we  asked  the  question  of  whether  we  could  work  with  this  collapsed  array.  The 
answer  is  yea,  since  all  of  the  information  we  need  to  preserve  is  the  response  pattern,  Le. 


^  **•  *"■  **  *’•  * 


I 


U|j^~*Jk)*  and  the  number  of  "correct"  responses  that  correspond  to  that  pattern.  Such 
information  allows  us  to  completely  reconstruct  the  original  matrix  of  responses.  Y,  except  for 
the  labelling  of  individuals,  and  thus  we  can  use  the  2k  array  x  to  represent  the  conditional 
distribution  of  X  given  {Y(+*y^}. 

Duncan  (1982)  and  Tjur  (1982)  independently  noted  that  we  can  estimate  the  item  parameters 


for  the  Rasch  model  of  expression  (3.10)  using  the  2k  array  x,  and  the  loglinear  model 

•  *  "wh,  ’  *  *  V‘  *  \  • 

where  the  subscript  *  P(  j^.8  ■  1  if  jg  ■  1  and  is  0  otherwise,  and 


(3.18) 


P  r  *  0  .  (3.19) 

r®  p 

More  specifically  Tjur  (1982)  shows  that  maximum  likelihood  estimation  of  the  2k  contingency 

table  of  expected  values,  m  *  (m  ).  using  a  Poisson  sampling  scheme  and  the  loglinear 

Mta 

model  of  expression  (3.18)  produces  die  conditional  maximum  xlikelihood  estimates  of  (r^)  for 
the  original  Rasch  model  Tjur  proves  this  equivalence  by  (1)  assuming  that  the  individual 
parameters  are  independent  identically  distributed  random  variables  from  some  completely 
unknown  distribution,  r;  (2)  integrating  the  conditional  distribution  of  Y  given  (Y  "y  )  over 
the  mixing  distribution,  #;  (3)  embedding  this  "random  effects”  model  in  an  "extended  random 
model";  and  (4)  noting  that  the  likelihood  for  the  extended  model  is  equivalent  to  that  for 


expremkm  (3.18)  applied  to  x. 


TABLE  3-3 


Multiplicative  Representation  of  Expected  Values  of  Model  (3.18)  for  the  Cdbe  k  ■  3 


Item  C 


Yea 

Item  A  * 
Yes  No 


No 

Item  A 
Yes  No 


Yes  abcS  bcS 


abS  bS 
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For  k*3,  the  loglinear  version  of  the  Reach  model  for  the  2*  table,  Le.  (3.18),  can  be 
represented  is  multiplicative  form  for  the  expected  values  m  as  in  Table  3-3.  The 
multiplicative  parameters  a,  b,  and  c  in  Table  3-3  correspond  to  the  ( v)  in  (3.18),  and  die 
multiplicative  parameters  IS(>  correspond  to  (r{).  The  minimal  sufficient  statistics  are 

<V>.  {x~*>  •  0.20) 

and 


*110  +  x,o,  +  x0„-  x,oo  +  xo,.  +  xoo,‘  xooo)  • 


(3.21) 


But  these  me  die  minimal  sufficient  statistics  of  the  model  of  quasi-symmetry  preserving  one¬ 


dimensional  marginal  totals  which  was  first  proposed  by  Bishop,  Fienberg.  and  Holland  (1975, 
Chapter  8).  Indeed,  the  quasi-symmetry  model  is  equivalent  to  that  of  expression  (3.18).  Thus 
following  the  prescription  of  Bishop,  Fienberg,  and  Holland  (1975,  p.305),  we  can  re-represent 
the  data  in  a  4-dimensional  redundant  form  (as  a  2x 2x2X6  table)  and  estimate  the  Reach 
model  item,  parameters  using  a  standard  logUnear  model  fitted  to  a  4-way  table  (although  not 
the  4-way  table  w  of  expression  (3.9)).  Additional  simplifications  ensue  here  because 


m  Xm  * 

^os  *  xooo  ’  (3-22) 


Duncan  (1982)  gives  several  examples  of  the  application  of  die  Reach  model  to  survey 
ressarch  problems,  and  he  presents  several  extensions  of  the  model  indicating  how  they  can  be 
represented  in  a  multi-dimensional  table  format  such  as  that  of  Table  3-3. 

Plackett  (1981),  in  a  very  brief  section  of  the  2nd  edition  of  his  monograph  on  categorical 
data  analysis,  notes  that  the  Q-stadsdc  of  Cochran  (1950)  can  be  viewed  as  a  means  of  tasting 
that  the  item  parameters  in  the  Rasch  model  are  all  equal  and  thus  zero,  le.  v  ■  0  for  all 
j.  This  observation  is  intimately  related  to  the  results  just  described,  and  our  original  data 
representation  in  the  form  of  an  nxk  (individual  by  item)  array  y  is  exactly  die  same 
representation  used  by  Cochran.  By  carrying  out  a  conditional  test  for  the  equality  of 
marginal  proportions  given  model  (3.18)  Le.  quad-symmetry  preserving  one-dimenskmal 


J*  >,*• 
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marginals,  we  get  a  test  that  is  essentially  equivalent  to  Cochran’s  test  But  this  is  also  the  teat 
for  *  0)  within  model  (3.18). 


4.  CORRESPONDENCE  ANALYSIS  FOR  CONTINGENCY  TABLE  DATA 


At  the  same  time  as  the  methods  for  loglinear  and  logit  models  were  being  developed  in  the 
United  States  and  the  United  Kingdom,  an  alternative  method  known  as  correspondence 
analysis  was  being  developed  and  practiced  by  a  group  of  French  statisticians  (e.g.  see 
Benzecri,  1973).  Correspondence  analysis  can  be  thought  of  as  a  special  case  of  canonical 
correlation  analysis,  which  is  used  in  general  to  study  linear  relationships  between  two  sets  of 
variables.  Keller  and  Wansbeek  (1983)  set  correspondence  analysis  in  the  context  of  an  errors- 
in-variables  model.  In  this  section,  we  briefly  outline  two  different  ways  that  correspondence 
analysis  can  be  used  to  analyze  contingency  table  data,  and  we  contrast  these  approaches  with 
the  loglinear  model  approach  which  is  the  main  focus  of  this  paper. 

The  first  approach  to  the  use  of  correspondence  analysis  is  as  a  technique  for  displaying  the 
rows  and  columns  of  a  two-way  contingency  table  as  points  in  corresponding  low-dimensional 
vector  spaces  (e.g.  see  Greenacre,  1981.  or  Heiser  and  Meulman,  1983).  When  the 
dimensionality  is  one  or  two,  these  spaces  are  often  superimposed  and  used  for  a  pictorial  joint 
display.  Let  x  be  an  rXc  matrix  of  counts,  and  A  and  B  corresponding  matrices  of  row  and 
column  proportions,  respectively.  Then  correspondence  analysis  finds  a  linear  mapping  between 
the  co-ordinates  f  of  the  rows  and  g  of  the  columns,  with  respect  to  the  principal  axes 
defined  by  the  following  eigen-equations: 

ATBTf  =  Xf 

BTATg  *  Xf  .  (4.1) 

The  eigenvalues  here  are  usually  arranged  in  descending  order 

'  X  2  X  2  .  .  .  2  X  2  0  .  (4.2) 

*  2  tt 

and  the  choice  of  m  *  2  allows  for  a  two-dimensional  graphical  display.  The  method  as 
outlined  here  is  limited  to  two-way  tables,  and  as  such  has  little  to  offer  for  the  analysis  of 
multi-way  tables  unless  several  variables  are  merged. 

O’Neill  (1978)  has  developed  a  formal  test  for  independence  in  a  two-way  table  using  a 
canonical  correlation  approach  and  Haberman  (1981)  has  demonstrated  its  relationship  to  a 
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simple  extension  of  the  loglinear  model  suggested  in  Fienberg  (1968)  and  developed  by 
Andersen  (1980)  and  Goodman  (1979).  The  canonical  correlation  test  extracts  the  largest 


eigenvalue.  of  the  rXr  matrix  B  with  elements 


where 


(4.3) 


x..  -  x.  x  ,/n 

d  =  _u _ t*  ii 

«  Cx,  x  .]* 


(4.4) 
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Tben  NR  2  is  then  referred  to  as  the  distribution  of  the  maximum  eigenvalue  of  an  (r-1)  X 
(r— 1)  central  Wishart  matrix  with  c-1  degrees  of  freedom.  Thus  when  r  >  2  and  c  >  2,  the 
asymptotic  distribution  of  NR 2  is  not  the  chi-squared  distribution.  Haberman  (1981)  has 
shown  that  this  test  is  equivalent  under  the  null  hypothesis  to  the  test  for  X  =  0  in  the  model 

log  m  =  u  +  u  ♦  u  +  \  v!  u1  ,  (4.5) 

6  ij  1(0  2(j)  1(0  2(j)  * 

where 


Xu  =  Zu  =  Zu/  -  Zu/  *  o  . 

i«>  Up  m  2(j> 


(4.6) 


De  Leeuw  (1983)  describes  the  interaction  structure  in  expression  (4.5)  as  bilinear.  For  a 
discussion  of  extensions  of  the  model  of  expressions  (4.5)  and  (4.6)  to  multi-way  tables  and  the 
use  of  such  models  for  ordered  categorical  data,  see  Fienberg  (1982). 


A  s  .cond  approach  to  the  yse  of  correspondence  analysis  (under  its  psychometric  name,  dual 
scaling)  for  multi-way  tables  is  presented  by  Nishisato  (1980)  who  describes  the  loglinear 
model  approach  to  multi-way  tables  as  a  form  of  analysis  of  variance  on  the  log-expected 
values.  (We  note  parenthetically  that  such  an  ANOVA  description  ignores  the  interpretation  of 
many  loglinear  models  in  terms  of  conditional  independence  of  variables  and  related  ideas,  as 
well  as  the  interpretation  of  individual  parameters  in  terms  of  odds  ratios.)  He  then  proposes 
a  pair  of  dual  scaling  or  reciprocal  approaches  in  which  the  unit  of  analysis  is  not  a  cell 
frequency  but  a  single  response  from  a  single  subject,  and  the  data  are  arrayed  in  a  two- 
dimensional  response  pattern  matrix.  Method  I  first  derives  an  optimal  response-score  vector 
7t,  and  then  subjects  it  to  a  standard  ANOVA.  Method  II  essentially  attempts  to  carry  out 
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the  two  parts  of  Method  I  simultaneously  (see  the  related  discussion  in  de  Leeuw  et  aL,  1976, 
and  Young  et  aL,  1976.  Deville  and  Sapoirta  (1983)  present  an  alternative  approach  to 
Nishisato’s  which  they  label  as  multiple  correspondence  analysis. 

Most  of  the  examples  of  these  optimal  scaling  or  correspondence  analysis  methods  applied  to 
multiway  contingency  table  data  that  appear  in  the  literature  are  not  especially  revealing.  For 
example,  Nishisato  presents  an  example  with  four  binary  explanatory  variables  and  3 
observations  for  each  of  the  2*  -  16  experimental  conditions,  and  where  there  are  8  response 
variables  each  with  5  categories.  Thus,  the  3  observations  are  spread  over  a  5*  response 
structure  and  make  the  estimation  of  a  standard  loglinear  model  virtually  impossible.  The 
heroic  and  un testable  assumptions  implicit  in  Nishisato’s  dual  scaling  analysis  need  to  be  made 
explicit  before  a  direct  comparison  with  loglinear  models  is  possible.  On  the  other  hand, 
Kester  and  Schriever  (1982)  analyze  a  standard  2X3X3X4  contingency  table  example  given  in 
Fienberg  (1980.  p.91).  A  reasonable  loglinear  model  for  the  data  in  that  table  has  as  its 
minimal  sufficient  statistics  only  two-way  marginal  totals.  Kester  and  Schrie*?^  analyst  sf 
this  example  relies  only  on  these  bivariate  two-way  totals,  and  introduces  scaling*  for  the  three 
non-binary  dimensions.  In  a  sense  their  analysis  can  be  thought  of  as  supplementing  the 
standard  loglinear  analysis. 


5.  COMPUTATIONAL  METHODS  FOR  LOGLINEAR  MODELS 

There  are  numerous  compuational  methods  which  can  be  used  to  fit  general  loglinear  models. 
For  any  particular  application  the  investigator  must  be  careful  to  select  a  method  which  is  not 
only  feasible  but  efficient  The  loglikelihood  function  for  the  loglinear  Poisson  model  is  a 
convex  function  and  when  the  MLE  exist  (i.e.  there  are  no  fitted  zeros)  it  is  a  strictly  convex 
function.  As  the  first  and  second  derivatives  of  the  loglikelihood  are  analytically  available,  a 
natural  contender  for  the  maximization  procedure  is  Newton’s  method. 

Newton’s  method  is  indeed  useful  for  loglinear  models  where  the  dimension  of  the  parameter 
space  is  small.  Using  Newton’s  method  has  the  advantage  that  the  calculations  necessary  to 
obtain  asymptotic  covariances  for  the  fitted  values  or  parameter  estimates  (i.e.  the  Fisher 
information  matrix  or  some  decomposition  of  it)  are  a  by  product  of  the  algorithm.  A  further 
advantage  is  that  in  general  Newton’s  method  can  be  expected  to  have  a  quadratic  rate  of 
convergence.  A  widely  available  implementation  of  Newton’s  method  for  discrete  data  is  in 
the  statistical  package  GLIM  (Baker  and  Nelder,  1978),  while  a  version  that  is  useful  only  for 
logit  models  can  be  found  in  the  BMDP  statistical  system.  There  are  however  several 
disadvantages  to  using  this  algorithm.  First,  fitted  zeros  can  cause  Newton’s  'method  to 
converge  much  more  slowly  than  one  would  otherwise  expect  Thus  it  is  imperative  that 
potential  singularities  in  the  data  be  detected  before  formal  fitting  is  attempted.  Fienberg, 
Meyer  and  Stewart  (1982)  ilse  a  result  of  Stewart  (1980)  to  first  screen  the  data  for  singularity 
problems  before  fitting  loglinear  or  logit  models  using  a  version  of  Newton’s  algorithm  which 
is  efficient  in  terms  of  both  time  and  storage.  Second,  a  more  limiting  problem  with 
Newton's  method  is  that  for  large  models  (greater  than  about  200  parameters)  the  amount  of 
computer  memory  required  to  store  the  second  derivative  matrix  exceeds  the  capacity  of  most 
computers.  Under  these  circumstances  a  possible  alternative  is  the  family  of  conjugate  gradient 
algorithms.  McIntosh  (1982)  has  investigated  the  possibility  of  using  conjugate-gradient  methods 
for  many  statistical  applications.  Conjugate  gradient  algorithms  have  the  property  that  they  do 
not  use  any  second  derivative  information  for  the  function  being  maximized.  This  means  that 
asymptotic  covariances  are  not  available  (other  than  in  some  special  closed-form  situations),  but 


the  method  can  consider  very  large  problems.  Whereas  Newton's  method  spends  a  lot  of 
effort  finding  a  good  search  direction,  conjugate-gradient  methods  merely  find  a  reasonable 
direction.  Thus  each  iteration  of  the  algorithm  requires  relatively  little  computation  but  the 
algorithm  is  likely  to  require  more  iterations  than  Newton's  method. 

Both  Newton’s  method  and  the  conjugate-gradient  methods  are  important  general  optimization 
tools  applicable  to  a  wide  variety  of  problems.  A  third  optimization  method  is  the  cyclic- 
coordinate  ascent  algorithm.  This  algorithm  is  not  widely  used  for  general  problems  but  is  of 
special  importance  in  logl inear  model  applications.  In  this  context  it  is  known  as  the  Iterative 
Proportional  Fitting  Procedure  (IPFP).  The  distinguishing  feature  of  IPFP  is  that  it  attempts 
to  maximize  the  likelihood  by  searching  along  a  series  of  fixed  directions,  and  thus  no 
computational  effort  is  expended  in  searching  for  good  directions.  What  makes  IPFP  attractive 
for  loglinear  model  applications  is  that  for  many  models  there  is  a  closed-form  answer  for 
maximization  along  the  fixed  directions.  Specifically,  let  m  be  the  current  vector  of  fitted 
values.  If  we  are  attempting  to  maximize  the  likelihood  along  a  vector,  fi,  which  consists  of 
only  zeros  and  ones,  then 

m  =  m-  <x,fi>  /  <m ,fi>  ,  (5.1) 

MW 

Le.  the  new  vector  of  fitted  values  is  just  a  simple  proportional  adjustment  of  the  old  vector. 
Thus  we  can  use  this  method  for  any  model  M,  which  can  be  spanned  by  a  set  of  vectors 
ksl,...4l  where  each  is  a  vector  of  only  zeros  and  ones.  For  each  direction,  there  is  a 
step  in  the  iteration  of  the  form  (5.1),  and  thus  a  complete  cycle  of  the  iteration  consists  of  a 
set  of  s  steps  along  the  s  fixed  directions  {fiY  1  corresponding  to  the  model  M.  The  class  of 
models  which  satisfy  these  requirements  is  surprisingly  large  and  includes  all  the  ANOVA-like 
loglinear  models  for  contingency  tables.  This  class  itself  includes  all  the  independence  and 
conditional  independence  models. 

The  IPFP  generally  takes  many  iterations  to  converge,  but  as  each  iteration  is  so  simple  that 
this  is  of  little  concern.  A  special  feature  of  the  IPFP  is  that  when  a  closed-form  MLE  exists 
the  algorithm  will  converge  in  at  most  two  iterations  and  one  can  assure  convergence  in  one 


iteration  by  choocing  an  appropriate  order  for  selection  of  the  spanning  vectors  (e.g.,  see 
Haberman.  1974).  The  main  advantage  of  the  IPFP  is  that  it  needs  to  store  only  the  table  of 
fitted  values  and  the  sufficient  statistics  in  computer  memory  and  thus  can  be  used  on  very 
large  problems.  The  BMDP  package  uses  this  method  in  its  contingency  table  program. 
BMDP3F. 

When  one'  has  a  large  contingency  table  problem  which  is  not  amenable  to  the  IPFP,  there 
are  several  alternatives.  An  algorithm  which  is  based  on  the  IPFP  but  which  is  useful  for  any 
model  is  the  generalized  iterative  scaling  method  of  Darroch  and  Ratcliff  (1972). 
Unfortunately  this  algorithm  has  the  slow  convergence  disadvantage  of  the  IPFP.  without  the 
important  advantage  .of  simple  computations  at  each  iteration.  For  special  applications  there 
are  several  alternatives.  Under  some  circumstances  it  is  possible  to  transform  a  contingency 
able  and  associated  models  into  a  form  where  the  simple  IPFP  can  be  used.  An  example  of 
this  is  given  in  Meyer  (1982).  For  the  simultaneous  logit  problem  it  is  possible  to  combine  the 
IPFP  with  some  other  algorithm,  e.g.  Newton’s  method  to  gain  a  useful  hybrid.  Rather  than 
consider  the  problem  as  a  large  contingency  table  one  can  fit  each  of  the  individual  logit 
models  which  comprise  the  system  and  combine  the  estimate  using  ideas  based  on  the  IPFP.  It 
is  then  necessary-  to  repeat  this  process  until  the  estimates  converge.  Some  details  on  this 
approach  are  available  in  Meyer  (1981). 

For  general  applications  of  the  loglinear  model  to  small  problems.  Newton’s  method  is 
probably  the  most  efficient  and  useful  algorithm.  As  the  size  of  the  problem  increases  the 
IPFP.  when  it  can  be  applied,  is  the  most  useful  algorithm. 
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